Contents:
3.2 Counting.
3.3 Measurement.
Most data are commonly organized in tabular format, that is, tables. When data is in tabular format, cases are organized in rows, while variables (information about the cases) are organized in columns. Almost every data you have used in a spreadsheet follows that structure.
For example, when you visit the website of the Common Core of Data from the US Department of Education, you can get a data set with detailed information on public schools at the state of Washington. Let me get a data table I have based on that:
link='https://github.com/EvansDataScience/VisualAnalytics_2_tabularData/raw/master/data/eduwa.rda'
#getting the data TABLE from the file in the cloud:
load(file=url(link))
When you are in RStudio, you can view the data table by clicking on its name at the Environment .
It also good to know how much info you have:
#number of rows and columns
dim(eduwa) #nrow(eduwa) ncol(eduwa)
## [1] 2427 24
This is the list of the 24 columns:
names(eduwa)
## [1] "NCES.School.ID" "State.School.ID"
## [3] "NCES.District.ID" "State.District.ID"
## [5] "Low.Grade" "High.Grade"
## [7] "School.Name" "District"
## [9] "County" "Street.Address"
## [11] "City" "State"
## [13] "ZIP" "ZIP.4-digit"
## [15] "Phone" "Locale.Code"
## [17] "LocaleType" "LocaleSub"
## [19] "Charter" "Title.I.School"
## [21] "Title.1.School.Wide" "Student.Teacher.Ratio"
## [23] "Free.Lunch" "Reduced.Lunch"
When dealing with tabular data, you can suspect that you can produce a visualization for each column, and then for a couple of them simultaneously, and then for three or more.
In this material, we will pay attention to the univariate case; which is common for searching problems or veryfing outcomes; not for giving explanations. Then, when dealing with univariate data, you need to be aware of two things: what question you are trying to answer; and how to treat a particular variable to build the plot that will answer that question.
I can not anticipate all the questions you can try to answer via plots; but I can tell you that you are always limited by the nature of the variables you have at hand. Generally speaking, you have either categorical or numerical data in each column, and whatever question you have, you first need to know how that variable you are planing to use has been encoded, so you can plan the treatment. In R, we can know that like this:
# this 'width = 70,strict.width='cut' means
# you do not want to see more than 70 characters per row.
str(eduwa,width = 70,strict.width='cut')
## 'data.frame': 2427 obs. of 24 variables:
## $ NCES.School.ID : chr "530486002475" "530270001270" "53091"..
## $ State.School.ID : chr "WA-31025-1656" "WA-06114-1646" "WA-"..
## $ NCES.District.ID : chr "5304860" "5302700" "5309100" "53000"..
## $ State.District.ID : chr "WA-31025" "WA-06114" "WA-34033" "WA"..
## $ Low.Grade : Ord.factor w/ 14 levels "PK"<"KG"<"1"<..: ..
## $ High.Grade : Ord.factor w/ 15 levels "PK"<"KG"<"1"<..: ..
## $ School.Name : chr "10th Street School" "49th Street Ac"..
## $ District : chr "Marysville School District" "Evergr"..
## $ County : chr "Snohomish" "Clark" "Thurston" "Gray"..
## $ Street.Address : chr "7204 27th Ave NE" "14619B NE 49th S"..
## $ City : chr "Marysville" "Vancouver" "Tumwater" "..
## $ State : chr "WA" "WA" "WA" "WA" ...
## $ ZIP : chr "98271" "98682" "98512" "98520" ...
## $ ZIP.4-digit : chr NA "6308" NA "5510" ...
## $ Phone : chr "(360)965-0400" "(360)604-6700" "(36"..
## $ Locale.Code : Factor w/ 12 levels "11","12","13",..: 5 2..
## $ LocaleType : Factor w/ 4 levels "City","Rural",..: 3 1 ..
## $ LocaleSub : Factor w/ 12 levels "City: Small",..: 5 2 ..
## $ Charter : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1..
## $ Title.I.School : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1..
## $ Title.1.School.Wide : Factor w/ 2 levels "No","Yes": 2 NA NA 2 N..
## $ Student.Teacher.Ratio: num 23.4 8.4 21.5 15.9 6.5 15.3 NA 16.3 1..
## $ Free.Lunch : num 28 53 169 292 12 411 48 102 101 268 ...
## $ Reduced.Lunch : num 3 9 40 10 4 23 12 22 23 0 ...
The ones that say num are obviously numbers (numbers in R are numeric when decimal values are detected, and integer if they are not). The ones that say chr are strings, which are candidates to be key columns, which are not variables themselves, but identifiers of the cases. In this case, the first four are identifiers, as well as the the 7th, 10th and 15th columns (school names, address and phone, respectively). Those variables are not to be analyzed statistically, but may be used for annotating (7th and 15th column) or for geocoding (10th column). Notice that for these data, State is not to be analyzed as it is a constant (all rows are from WA); but it would be if the data were from the whole USA. Then, you see several variables identified as factor or ordered factor, which are categorical variables: they can be analyzed statistically but not in the same way that numbers.
Data is obtained via different processes. When you observe reality, you can classsify, count or measure. Each of these decisions produces data with some basic characteristics; which are represented via categories or numerical values.
Categorical data are the output of the classification process. The classification can propose an incremental or non-incremental differentiation. The former are named ordinal data and the latter nominal data. A nominal classification related to education can be type of school funding: public or private; while an ordinal one can be: elementary, middle, high, college and graduate school level.
Let’s see some raw values in the variable LocaleType:
head(eduwa$LocaleType,50) #first fifty values
## [1] Suburb City City Town City City Suburb City Rural Rural
## [11] City City City City Suburb Rural Rural Rural Suburb City
## [21] Suburb Suburb Suburb Suburb City City Suburb Suburb Rural Rural
## [31] Rural Suburb City City Rural City City Rural Rural City
## [41] Town Town Rural City Suburb City City Rural City City
## Levels: City Rural Suburb Town
You can not get a clear idea of what a data table has, so a simple frequency table is the first tool to see what these nominal data are telling us:
# absolute values
table(eduwa$LocaleType,exclude = 'nothing')
##
## City Rural Suburb Town <NA>
## 714 505 798 338 72
# relative values
absoluteT=table(eduwa$LocaleType,exclude = 'nothing')
prop.table(absoluteT)
##
## City Rural Suburb Town <NA>
## 0.29419036 0.20807581 0.32880099 0.13926658 0.02966625
This table tells us the location of the public schools. What is the right visual for this? Sometimes the answer seems obvious, as tradition or habits give so much weight to decisions. Let’s use the very well known pie chart:
# the pie plots the table:
ToPlot=prop.table(absoluteT)
pie(ToPlot)
You should always keep it simple. Then decorate. For example, you can start improving the plot you already have:
names(ToPlot)
## [1] "City" "Rural" "Suburb" "Town" NA
We could alter the fifth label:
names(ToPlot)[5]='Unknown'
# the pie plots the table:
titleText='Where are Public Schools located in WA in 2019?'
sourceText='Source: US Department of Education'
pie(ToPlot,
main=titleText,
sub=sourceText)
The title can guide the reader to recognise the purpose of your plot:
# the pie plots the table:
titleText2='WA still has schools locations unknown \n (info from 2018)'
pie(ToPlot,
main=titleText2,
sub=sourceText)
The title can also suggest the decision:
# the pie plots the table:
titleText3='WA needs to fully categorize school locations\n(info from 2018)'
pie(ToPlot,
main=titleText3,
sub=sourceText)
Titles are no that easy to produce. You need to rewrite them many times, until you find a good combination of words that can be read in less than ten seconds. Is also good to keep in mind that you must never give your audience a cacophonous version (a tongue twister), and neither should you include adjectives.
A general rule for any plot is to make it reachable for the audience you are writing to (read this every day).
Let’s do more customization:
ToPlot*100
## City Rural Suburb Town Unknown
## 29.419036 20.807581 32.880099 13.926658 2.966625
# details:
ToPlot=ToPlot*100 # preparing labels
paletteHere=rainbow(length(ToPlot)) # customizing set of colors
# plotting
pie(x=ToPlot,#table
col = paletteHere,
labels = ToPlot,
main=titleText,
sub=sourceText)
The labels need better work:
paste0(round(ToPlot,2),'%')
## [1] "29.42%" "20.81%" "32.88%" "13.93%" "2.97%"
Then,
plotLabels=paste0(round(ToPlot,2),'%') # labels for the slices
# plotting
pie(x=ToPlot,#table
col = paletteHere,
labels = plotLabels,
main=titleText,
sub=sourceText)
# plotting
pie(x=ToPlot,#table
col = paletteHere,
labels = plotLabels,
main=titleText,
sub=sourceText)
#legend
legend(x="right", #where
legend=names(ToPlot), #text
fill = paletteHere) #symbols' colors
Most legends need customization:
# MANAGING THE LEGEND:
pie(x=ToPlot,#table
col = paletteHere,
labels = plotLabels,
main=titleText,
sub=sourceText)
#legend
legend(x="right", #where
legend=names(ToPlot), #text
fill = paletteHere,
bty = 'n', # no box
cex = 0.5 # shrink
) #symbols' colors
Most people tend to use pie charts with categorical data, but this should not be the default option to visualize classification (see this discussion). Following the advice from the video in that post, we should turn our pie into a bar chart. Let me do it with the same info I used to build the pie:
# barplot plots the table too
barplot(ToPlot,
col = paletteHere,
main=titleText,
sub=sourceText)
We saved some space as no legend was needed (the less ink the better visual). Speaking of saving, we can get rid of the colors (they were needed to differentiate the slices).
A very important thing to consider is that axes represent some unit of measurement, so make sure that unit is shown:
paletteHereNew=c('gray') # just one color
# plotting
barplot(ToPlot,
col = paletteHereNew,
border=NA, #no border
main=titleText,
sub=sourceText,
ylim=c(0,50),
ylab = '(in %)' # show unit
)
If you consider annotating the plot, you can use the labels we used before:
# plotting
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText,
sub=sourceText,
ylim=c(0,50),
ylab = '(in %)')
#annotating
text(x=location,y=ToPlot,labels=plotLabels,
pos = 1,# if pos=3, text will be on top of bar
cex = 0.8)
You may decide to change the orientation of the plot:
# plotting
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText,
sub=sourceText,
ylim=c(0,50),
ylab = '(in %)',
horiz = T) # ORIENTATION
#annotating
text(x=location,y=ToPlot,labels=plotLabels,
pos = 1) # this is the position of the label
The problem above is that changing the orientation, changes the axes. Then, we need to do more work:
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText,
sub=sourceText,
xlim=c(0,50), #change to xlim
xlab = '(in %)', #change to xlab
horiz = T)
#annotating
text(x=ToPlot,y=location, #change of x and y
labels=plotLabels,
pos = 3) # change position of the label
A little more work on the categories names:
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText,
sub=sourceText,
cex.names = 0.7, #shrink category names
xlim=c(0,50),
xlab = '(in %)',
horiz = T)
#annotating
text(x=ToPlot,y=location,labels=plotLabels,pos = 4)
We made the right changes, but some things do not look well. It would be better if:
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText, # no sub here!
xlim=c(0,50),
cex.names = 0.5,
xlab = '(in %)',
horiz = T)
# annotating
text(x=ToPlot,y=location,labels=plotLabels,pos = 4)
# subtitle
title(sub=sourceText,
adj=0,#adj=1 aligns to rigth.
cex.sub=0.7) #shrinking text
# changing parameters
# (distanceOfUnit To plot,
# distanceOfAxislabels to plot,
# distance ofTicks to plot)
# default is: mgp=c(3, 1, 0)
par(mgp=c(0.5,0.5,0))
#####
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText,
xlim=c(0,50),
xlab = '(in %)',
cex.names = 0.6,
cex.lab=0.6, # shrinking label text
horiz = T)
text(x=ToPlot,y=location,labels=plotLabels,pos = 4)
title(sub=sourceText, adj=0,cex.sub=0.7,
line = 3) #push the text down
titleText2='Are all locations getting a fair share of public schools in WA?'
par(mgp=c(1,0.5,0))
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText2,
xlim=c(0,50),
cex.names = 0.6,
cex.lab=0.6,
xlab = '(in %)',
horiz = T
)
text(x=ToPlot,y=location,labels=plotLabels,pos = 4)
title(sub=sourceText, adj=0,cex.sub=0.7,
line = 3)
# reference line
abline(v=25,#position vertical
lty=3,#type
lwd=3)#width
Again, adding another element requires adjusting what we had. What about writing your own axis-values and reducing the bar annotations:
par(mgp=c(1,0.5,0))
location=barplot(ToPlot,
col = paletteHereNew,
border=NA,
main=titleText2,
xlim=c(0,50),
xlab = '(in %)',
cex.names=0.6,
cex.lab=0.6,
las=2,
horiz = T,
xaxt="n") # no x-axis, so I customize it below...
text(x=ToPlot,y=location,labels=plotLabels,pos = 4,cex = 0.7)
title(sub=sourceText, adj=0,cex.sub=0.7,line = 3)
#reference line
abline(v=25,lty=3,lwd=3)
# customizing tick values
newXvalues<-c(0,10,25,40,50) # you just want to show this on the axis
axis(side=1,
at=newXvalues,
labels = newXvalues,
cex.axis=0.8)
So far, we have used the basic R capabilities for plotting.
There are alternative libraries, like ggplot2, that are also frequently used. However, it has a different approach, which allows to add layers that let you customize your plot. The classic approach for ggplot is:
tableFreq=as.data.frame(ToPlot)
names(tableFreq)=c("locale","pct")
#you have:
tableFreq
## locale pct
## 1 City 29.419036
## 2 Rural 20.807581
## 3 Suburb 32.880099
## 4 Town 13.926658
## 5 Unknown 2.966625
library(ggplot2)
#base GGPLOT2 starts with a "base", telling WHAT VARIABLES TO PLOT
base= ggplot(data = tableFreq,
aes(x = locale,
y = pct))
plot1 = base + geom_bar(fill ="gray",
stat = 'identity') # y is just what it is!
plot1
plot2 = plot1 + labs(title=titleText2,
x =NULL,
y = NULL,
caption = sourceText)
plot2
plot3 = plot2 + geom_hline(yintercept = 25, #where
linetype="dashed",
size=1.5, #thickness
alpha=0.5) #transparency
plot3
library(scales)
# customize Y axis
plot4 = plot3 + scale_y_continuous(breaks=c(0,10, 25,40,50),
limits = c(0, 50), # expand = c(0, 0),
labels=scales::unit_format(suffix = '%'))
plot4
plot5 = plot4 + theme(panel.background = element_rect(fill = "white",
colour = "grey50"),
plot.caption = element_text(hjust = 0), # default was 1
plot.title = element_text(hjust = 0.5))
plot5
plot6 = plot5 + geom_text(aes(y = pct ,
label = paste0(round(pct,2), '%')),
vjust=1, # if flipping 'hjust'
size = 3)
# wanna flip the plot?
plot6 #+ coord_flip()
Bar plots are the default option for categorical variables. In general, you see the distribution of the classification, which allows you to identify concentration. For that reason, ordering the bars by height can be helpful:
###
ToPlotOrd=sort(ToPlot)
###
par(mgp=c(1,0.5,0)) # distance label, tickText,tick
location=barplot(ToPlotOrd,
col = paletteHereNew,
border=NA,
main=titleText2,
xlim=c(0,50),
xlab = '(in %)',
horiz = T,
cex.names = 0.7,
cex.lab=0.6,
xaxt="n") # no x-axis, so I customize it below...
text(x=ToPlotOrd,y=location,labels=plotLabels,pos = 2,cex = 0.7)
title(sub=sourceText, adj=0,cex.sub=0.7,line = 3)
# reference line
abline(v=25,lty=3,lwd=3)
# customizong tick values
xtick<-c(0,10,25,40,50)
axis(side=1, at=xtick, labels = xtick,cex.axis=0.8)
The plot above simply change the order of the table. If you want to do the same with ggplot you should try the command:
tableFreq[order(-tableFreq$pct),]
## locale pct
## 3 Suburb 32.880099
## 1 City 29.419036
## 2 Rural 20.807581
## 4 Town 13.926658
## 5 Unknown 2.966625
Exercise:
Use ggplot to show a bar plot ordered by share size.
We could use our reference line to show gaps or differences. In this case, the Lollipop plot may be useful. This one is just a replacement for a bar plot:
base = ggplot(tableFreq, aes(x=locale,pct))
lolliplot1=base + geom_segment(aes(y = 0,
x = locale,
yend = pct,
xend = locale), color = "grey50")
lolliplot1 + geom_point()
And, if you order the data frame:
tableFreq[order(tableFreq$pct),]
## locale pct
## 5 Unknown 2.966625
## 4 Town 13.926658
## 2 Rural 20.807581
## 1 City 29.419036
## 3 Suburb 32.880099
You can get:
# reordering DF steps:
tableFreqO=tableFreq[order(tableFreq$pct),]
base = ggplot(tableFreqO, aes(locale,pct))
lolliplot1=base + geom_segment(aes(y = 0,
x = locale,
yend = pct,
xend = locale), color = "gray")
lolliplot2 = lolliplot1 + geom_point()
lolliplot2 + scale_x_discrete(limits=tableFreqO$locale) # key element
And, what about changing the axis values so that we can identify the gaps:
# new variable
tableFreqO$gap=tableFreqO$pct-25
# plot the new variable
base = ggplot(tableFreqO, aes(locale,gap))
lolliplot1=base + geom_segment(aes(y = 0,
x = locale,
yend = gap,
xend = locale), color = "gray")
lolliplot2 = lolliplot1 + geom_point()
lolliplot2 + scale_x_discrete(limits=tableFreqO$locale) # key element
Maybe add some color:
# a new column for color
tableFreqO$PositiveGap=ifelse(tableFreqO$gap>0,T,F)
# add new aesthetics 'color'
base = ggplot(tableFreqO, aes(locale,gap,
color=PositiveGap)) #change
lolliplot1=base + geom_segment(aes(y = 0,
x = locale,
yend = gap,
xend = locale), color = "gray")
lolliplot2 = lolliplot1 + geom_point()
lolliplot2 + scale_x_discrete(limits=tableFreqO$locale) # key element
Maybe add some extra info:
# a new column for color
tableFreqO$PositiveGap=ifelse(tableFreqO$gap>0,T,F)
base = ggplot(tableFreqO, aes(locale,gap,color=PositiveGap,
label = round(gap,3))) # change
lolliplot1=base + geom_segment(aes(y = 0,
x = locale,
yend = gap,
xend = locale), color = "gray")
lolliplot2=lolliplot1 + geom_point()
lolliplot3= lolliplot2 + scale_x_discrete(limits=tableFreqO$locale)
# annotating and moving the text on the horizontal
lolliplot3 + geom_text(nudge_x=0.3)
You can avoid the overlaping symbols in the legend by using:
lolliplot3 + geom_text(nudge_x=0.3,show.legend = FALSE)
Exercise:
Complete adding the elements missing in the last plot.
For this section, we will use the variable that tells us the highest grade offered in a school. A simple exploration gives:
table(eduwa$High.Grade,exclude = 'nothing')
##
## PK KG 1 2 3 4 5 6 7 8 9 10 11 12 13
## 82 7 6 16 19 45 755 266 11 427 15 7 5 757 9
Being a categorical variable, the default option is again the bar plot. So let’s prepare the frequency table as a data frame:
frqTabO=as.data.frame(prop.table(table(eduwa$High.Grade)))
names(frqTabO)=c('grade','pct')
frqTabO
## grade pct
## 1 PK 0.033786568
## 2 KG 0.002884219
## 3 1 0.002472188
## 4 2 0.006592501
## 5 3 0.007828595
## 6 4 0.018541409
## 7 5 0.311083642
## 8 6 0.109600330
## 9 7 0.004532344
## 10 8 0.175937371
## 11 9 0.006180470
## 12 10 0.002884219
## 13 11 0.002060157
## 14 12 0.311907705
## 15 13 0.003708282
Now, we can use ggplot:
base = ggplot(frqTabO,aes(x=grade,y=pct))
base + geom_bar(stat = 'identity')
The x-values in this variable have order. That is, there is an increasing level in the values. Whenever we have an ordering, besides concentration we can visualize symmetry: if there is bias towards lower or higher values.
Bar plots help you see concentration and symmetry, but we have an alternative way to clearly detect symmetry, via boxplots:
# boxplots do not use frequency tables
# as.numeric produces turns levels of the factor into numbers
box1 = ggplot(eduwa, aes(y=as.numeric(High.Grade)))
box1 = box1 + geom_boxplot() + coord_flip() # to show it horizontally
box1
You have symmetry when the distance of those whiskers to the box is the same, and when the thick line is in the middle of the box. You can see that the values show a negative asymmetry (tail to the left).
Box plots expect a numeric value as an input, but we have an ordered categorical, so we used the as.numeric() function. However, that eliminated the levels we saw in the previous bar plot; we can put the levels back in our plot:
# the labels use the original ordinal levels
ordLabels= levels(eduwa$High.Grade)
box2 = box1 + scale_y_continuous(labels=ordLabels,breaks=1:15)
box2
Box plots have important statistical information. The beginning and the ending of the box indicates the first (q1) and the third quantile (q75); and the thicker line in the middle represents the median. All those values are clearly visible, but we can retrieve the values like this:
#get positions
# using 'ggplot_build'
pos_q1= ggplot_build(box2)$data[[1]]$lower
pos_median= ggplot_build(box2)$data[[1]]$middle
pos_q3= ggplot_build(box2)$data[[1]]$upper
# using
levels(eduwa$High.Grade)[c(pos_q1,pos_median,pos_q3)]
## [1] "5" "8" "12"
From the information retrieved, we know:
We can find these results with a detailed frequency table; that is, instead of using the command table as we did before, we could try a more advanced function:
library(summarytools)
freq(eduwa$High.Grade,style = 'rmarkdown')
Variable: eduwa$High.Grade
Type: Factor (ordered)
| Â | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
|---|---|---|---|---|---|
| PK | 82 | 3.38 | 3.38 | 3.38 | 3.38 |
| KG | 7 | 0.29 | 3.67 | 0.29 | 3.67 |
| 1 | 6 | 0.25 | 3.91 | 0.25 | 3.91 |
| 2 | 16 | 0.66 | 4.57 | 0.66 | 4.57 |
| 3 | 19 | 0.78 | 5.36 | 0.78 | 5.36 |
| 4 | 45 | 1.85 | 7.21 | 1.85 | 7.21 |
| 5 | 755 | 31.11 | 38.32 | 31.11 | 38.32 |
| 6 | 266 | 10.96 | 49.28 | 10.96 | 49.28 |
| 7 | 11 | 0.45 | 49.73 | 0.45 | 49.73 |
| 8 | 427 | 17.59 | 67.33 | 17.59 | 67.33 |
| 9 | 15 | 0.62 | 67.94 | 0.62 | 67.94 |
| 10 | 7 | 0.29 | 68.23 | 0.29 | 68.23 |
| 11 | 5 | 0.21 | 68.44 | 0.21 | 68.44 |
| 12 | 757 | 31.19 | 99.63 | 31.19 | 99.63 |
| 13 | 9 | 0.37 | 100.00 | 0.37 | 100.00 |
| <NA> | 0 | 0.00 | 100.00 | ||
| Total | 2427 | 100.00 | 100.00 | 100.00 | 100.00 |
Exercise:
Make sure our box plot follows the same design approach and include all the elements as in the bar plot for nominal data.
Counting expresses numerical values. They could be represented with bar plots if their frequency table had few different values. For example, the variable Reduced.Lunch informs how many kids there are in each school that have that lunch for a reduced price.
# how many unique values
length(unique(eduwa$Reduced.Lunch))
## [1] 172
There are too many different values. Then, the bar plot is not a good idea (and neither a frequency table):
barplot(table(eduwa$Reduced.Lunch),las=2,cex.names = 0.3,
main='bad idea')
On the other hand, when we have a numerical variable, there are more statistical values that help understand its behavior:
# median close to mean?
# median and mean far from max or min?
# q1 distance to min is similar ti q3 distance to max?
# how many missing?
summary(eduwa$Reduced.Lunch)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 25.50 33.53 47.00 301.00 131
The bar plot produces a bar for each unique value in the data, counting how many times this value appeared. Now, we have many values, so we need to organize the data into intervals. The histogram is the basic plot when intervals are needed, you can use the basic function:
eduwa3=eduwa[complete.cases(eduwa$Reduced.Lunch),]
dataHist=hist(eduwa3$Reduced.Lunch) #saving info in dataHist
The width of each bin (bar) represents an interval of values, while its height the frequency. The histogram shows an asymmetric shape, where the bin with lowest values of the variable (between 0 and 20) are the most common (above 1000).
Of course, ggplot has a version of histograms:
base= ggplot(eduwa3,aes(x = Reduced.Lunch))
h1= base + geom_histogram()
h1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that you do not get the same plot. Let’s see the info from the basic function:
dataHist
## $breaks
## [1] 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320
##
## $counts
## [1] 1011 549 371 166 72 49 30 18 12 8 2 3 1 2
## [15] 1 1
##
## $density
## [1] 2.201655e-02 1.195557e-02 8.079268e-03 3.614983e-03 1.567944e-03
## [6] 1.067073e-03 6.533101e-04 3.919861e-04 2.613240e-04 1.742160e-04
## [11] 4.355401e-05 6.533101e-05 2.177700e-05 4.355401e-05 2.177700e-05
## [16] 2.177700e-05
##
## $mids
## [1] 10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 310
##
## $xname
## [1] "eduwa3$Reduced.Lunch"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
And now see the info that was used in ggplot:
ggplot_build(h1)$data[[1]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## y count x xmin xmax density ncount
## 1 590 590 0.00000 -5.189655 5.189655 2.475778e-02 1.000000000
## 2 287 287 10.37931 5.189655 15.568966 1.204319e-02 0.486440678
## 3 271 271 20.75862 15.568966 25.948276 1.137179e-02 0.459322034
## 4 317 317 31.13793 25.948276 36.327586 1.330206e-02 0.537288136
## 5 247 247 41.51724 36.327586 46.706897 1.036470e-02 0.418644068
## 6 183 183 51.89655 46.706897 57.086207 7.679107e-03 0.310169492
## 7 103 103 62.27586 57.086207 67.465517 4.322120e-03 0.174576271
## 8 83 83 72.65517 67.465517 77.844828 3.482874e-03 0.140677966
## 9 54 54 83.03448 77.844828 88.224138 2.265966e-03 0.091525424
## 10 32 32 93.41379 88.224138 98.603448 1.342795e-03 0.054237288
## 11 25 25 103.79310 98.603448 108.982759 1.049058e-03 0.042372881
## 12 23 23 114.17241 108.982759 119.362069 9.651336e-04 0.038983051
## 13 20 20 124.55172 119.362069 129.741379 8.392466e-04 0.033898305
## 14 13 13 134.93103 129.741379 140.120690 5.455103e-04 0.022033898
## 15 11 11 145.31034 140.120690 150.500000 4.615857e-04 0.018644068
## 16 7 7 155.68966 150.500000 160.879310 2.937363e-04 0.011864407
## 17 5 5 166.06897 160.879310 171.258621 2.098117e-04 0.008474576
## 18 7 7 176.44828 171.258621 181.637931 2.937363e-04 0.011864407
## 19 3 3 186.82759 181.637931 192.017241 1.258870e-04 0.005084746
## 20 6 6 197.20690 192.017241 202.396552 2.517740e-04 0.010169492
## 21 1 1 207.58621 202.396552 212.775862 4.196233e-05 0.001694915
## 22 2 2 217.96552 212.775862 223.155172 8.392466e-05 0.003389831
## 23 1 1 228.34483 223.155172 233.534483 4.196233e-05 0.001694915
## 24 0 0 238.72414 233.534483 243.913793 0.000000e+00 0.000000000
## 25 1 1 249.10345 243.913793 254.293103 4.196233e-05 0.001694915
## 26 0 0 259.48276 254.293103 264.672414 0.000000e+00 0.000000000
## 27 1 1 269.86207 264.672414 275.051724 4.196233e-05 0.001694915
## 28 2 2 280.24138 275.051724 285.431034 8.392466e-05 0.003389831
## 29 0 0 290.62069 285.431034 295.810345 0.000000e+00 0.000000000
## 30 1 1 301.00000 295.810345 306.189655 4.196233e-05 0.001694915
## ndensity PANEL group ymin ymax colour fill size linetype alpha
## 1 1.000000000 1 -1 0 590 NA grey35 0.5 1 NA
## 2 0.486440678 1 -1 0 287 NA grey35 0.5 1 NA
## 3 0.459322034 1 -1 0 271 NA grey35 0.5 1 NA
## 4 0.537288136 1 -1 0 317 NA grey35 0.5 1 NA
## 5 0.418644068 1 -1 0 247 NA grey35 0.5 1 NA
## 6 0.310169492 1 -1 0 183 NA grey35 0.5 1 NA
## 7 0.174576271 1 -1 0 103 NA grey35 0.5 1 NA
## 8 0.140677966 1 -1 0 83 NA grey35 0.5 1 NA
## 9 0.091525424 1 -1 0 54 NA grey35 0.5 1 NA
## 10 0.054237288 1 -1 0 32 NA grey35 0.5 1 NA
## 11 0.042372881 1 -1 0 25 NA grey35 0.5 1 NA
## 12 0.038983051 1 -1 0 23 NA grey35 0.5 1 NA
## 13 0.033898305 1 -1 0 20 NA grey35 0.5 1 NA
## 14 0.022033898 1 -1 0 13 NA grey35 0.5 1 NA
## 15 0.018644068 1 -1 0 11 NA grey35 0.5 1 NA
## 16 0.011864407 1 -1 0 7 NA grey35 0.5 1 NA
## 17 0.008474576 1 -1 0 5 NA grey35 0.5 1 NA
## 18 0.011864407 1 -1 0 7 NA grey35 0.5 1 NA
## 19 0.005084746 1 -1 0 3 NA grey35 0.5 1 NA
## 20 0.010169492 1 -1 0 6 NA grey35 0.5 1 NA
## 21 0.001694915 1 -1 0 1 NA grey35 0.5 1 NA
## 22 0.003389831 1 -1 0 2 NA grey35 0.5 1 NA
## 23 0.001694915 1 -1 0 1 NA grey35 0.5 1 NA
## 24 0.000000000 1 -1 0 0 NA grey35 0.5 1 NA
## 25 0.001694915 1 -1 0 1 NA grey35 0.5 1 NA
## 26 0.000000000 1 -1 0 0 NA grey35 0.5 1 NA
## 27 0.001694915 1 -1 0 1 NA grey35 0.5 1 NA
## 28 0.003389831 1 -1 0 2 NA grey35 0.5 1 NA
## 29 0.000000000 1 -1 0 0 NA grey35 0.5 1 NA
## 30 0.001694915 1 -1 0 1 NA grey35 0.5 1 NA
The first ‘x’ was 0 in ggplot, while it was 10 (in $mids) in the base graphic; from there on everything changed. And not only that, you have 16 bins in the base graphic, while you got 30 in ggplot.
Of course, you can alter that in both alternatives.
Below, you can see a version where both plots are the same:
#ggplot
base= ggplot(eduwa3,aes(x = Reduced.Lunch))
h1= base + geom_histogram(binwidth = 20,boundary=0) #changing width
h1= h1 + stat_bin(binwidth = 20, aes(label=..count..),
geom = "text",boundary = 0,vjust=-0.5)
h1
# base
hist(eduwa3$Reduced.Lunch,labels = T,xlab="Reduced Lunch")
Of course, you can make it a litle better:
hist(eduwa3$Reduced.Lunch,labels = T,xlab="Reduced Lunch", xaxt="n")
axis(side=1, at=dataHist$breaks) # showing axis labels better
As mentioned before, we are plotting intervals, so the accompanying table can be built. For that, we first create the intervals into another variable:
eduwa3$redLunchOrd=cut(eduwa3$Reduced.Lunch,
breaks = dataHist$breaks,
include.lowest = T,
ordered_result = T)
And, as before, we use the freq function:
# no need to show count of NAs:
freq(eduwa3$redLunchOrd,style = 'rmarkdown',report.nas = F)
Variable: eduwa3$redLunchOrd
Type: Factor (ordered)
| Â | Freq | % | % Cum. |
|---|---|---|---|
| [0,20] | 1011 | 44.03 | 44.03 |
| (20,40] | 549 | 23.91 | 67.94 |
| (40,60] | 371 | 16.16 | 84.10 |
| (60,80] | 166 | 7.23 | 91.33 |
| (80,100] | 72 | 3.14 | 94.47 |
| (100,120] | 49 | 2.13 | 96.60 |
| (120,140] | 30 | 1.31 | 97.91 |
| (140,160] | 18 | 0.78 | 98.69 |
| (160,180] | 12 | 0.52 | 99.22 |
| (180,200] | 8 | 0.35 | 99.56 |
| (200,220] | 2 | 0.09 | 99.65 |
| (220,240] | 3 | 0.13 | 99.78 |
| (240,260] | 1 | 0.04 | 99.83 |
| (260,280] | 2 | 0.09 | 99.91 |
| (280,300] | 1 | 0.04 | 99.96 |
| (300,320] | 1 | 0.04 | 100.00 |
| Total | 2296 | 100.00 | 100.00 |
Exercise:
Make a histogram for the variable FREE LUNCH, and make sure it has all the right elements, and get rid of unnecessary elements.
A simplistic idea of measurement tells you the times a particular unit is present in the unit of analysis; which allows for the presence of decimal places. There are variables that can have negative values.
Let’s analyze the variable Student.Teacher.Ratio, but organized by county:
# tapply(variable,group,functionToApply)
tapply(eduwa$Student.Teacher.Ratio, eduwa$County, mean)
## Adams Asotin Benton Chelan Clallam
## NA NA NA NA NA
## Clark Columbia Cowlitz Douglas Ferry
## NA NA NA NA NA
## Franklin Garfield Grant Grays Harbor Island
## NA 17.35000 NA NA NA
## Jefferson King Kitsap Kittitas Klickitat
## NA NA NA NA NA
## Lewis Lincoln Mason Okanogan Pacific
## NA 11.56000 NA NA NA
## Pend Oreille Pierce San Juan Skagit Skamania
## 15.47778 NA NA NA 16.37000
## Snohomish Spokane Stevens Thurston Wahkiakum
## NA NA NA NA 18.15000
## Walla Walla Whatcom Whitman Yakima
## NA NA NA NA
Above, I tried to compute the mean for each county, but the function mean() outputs a missing value (NA) as the result when there is one NA in the column:
# strategy 1: remove missing before computing function: na.rm=T
tapply(eduwa$Student.Teacher.Ratio, eduwa$County, mean,na.rm=T)
## Adams Asotin Benton Chelan Clallam
## 14.80769 19.11111 20.38947 18.65000 19.27600
## Clark Columbia Cowlitz Douglas Ferry
## 19.19744 11.30000 20.38974 16.47222 16.82727
## Franklin Garfield Grant Grays Harbor Island
## 18.10323 17.35000 17.52449 16.33846 19.53000
## Jefferson King Kitsap Kittitas Klickitat
## 16.57857 18.80203 19.24030 18.93529 14.82632
## Lewis Lincoln Mason Okanogan Pacific
## 18.93902 11.56000 16.76316 19.73226 22.39375
## Pend Oreille Pierce San Juan Skagit Skamania
## 15.47778 18.77931 15.90000 22.47209 16.37000
## Snohomish Spokane Stevens Thurston Wahkiakum
## 20.49239 18.13542 19.61765 19.29231 18.15000
## Walla Walla Whatcom Whitman Yakima
## 15.55652 23.77213 13.27083 19.97500
Of course, you can clean first:
# strategy 2:
eduwa4=eduwa[complete.cases(eduwa$Student.Teacher.Ratio),]
tapply(eduwa4$Student.Teacher.Ratio,
eduwa4$County,
mean)
## Adams Asotin Benton Chelan Clallam
## 14.80769 19.11111 20.38947 18.65000 19.27600
## Clark Columbia Cowlitz Douglas Ferry
## 19.19744 11.30000 20.38974 16.47222 16.82727
## Franklin Garfield Grant Grays Harbor Island
## 18.10323 17.35000 17.52449 16.33846 19.53000
## Jefferson King Kitsap Kittitas Klickitat
## 16.57857 18.80203 19.24030 18.93529 14.82632
## Lewis Lincoln Mason Okanogan Pacific
## 18.93902 11.56000 16.76316 19.73226 22.39375
## Pend Oreille Pierce San Juan Skagit Skamania
## 15.47778 18.77931 15.90000 22.47209 16.37000
## Snohomish Spokane Stevens Thurston Wahkiakum
## 20.49239 18.13542 19.61765 19.29231 18.15000
## Walla Walla Whatcom Whitman Yakima
## 15.55652 23.77213 13.27083 19.97500
Great!
Now let me plot a histogram of those means:
# keeping strategy 2:
meanValues=tapply(eduwa4$Student.Teacher.Ratio,
eduwa4$County,
mean)
hist(meanValues)
Let’s compute some statistics:
summary(meanValues)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.30 16.42 18.65 17.96 19.41 23.77
You can use that info, for example, to plot the mean as a reference line:
#reference line
hist(meanValues)
abline(v=mean(meanValues),lty=3,lwd=3,col='blue')
Measurements are continuous values, then a density plot is more appealing to its nature:
mvDense=density(meanValues)
plot(mvDense,main="Title",col='black',xlab=NA)
abline(v=mean(meanValues),lty=3,lwd=3,col='blue') #mean
abline(v=median(meanValues),lty=3,lwd=3,col='red')#median
legend(x="right",
legend=c('mean','median'),
fill = c('blue','red'),bty = 'n') #no box in the legend
A box plot is always welcome, specially considering that it does not need reference lines. Take a look:
bp=boxplot(meanValues,horizontal = T,ylim=c(5,30))
Our plots for the mean values have a more symmetrical shape. This happens when you get mean values of groups, showing a tendency towards a bell-shaped distribution, which is ideally known as the Gauss or Normal distribution.
Notice also that boxplots serve to detect atypical values (outliers), which I saved in bp:
bp$out
## Columbia Lincoln
## 11.30 11.56
We could annotate the boxplot like this:
boxplot(meanValues,horizontal = T,ylim=c(5,30))
text(x= 10, y= 0.8, labels= "Outliers are:",col='gray')
text(x= 10, y= 0.75,
labels= paste(names(bp$out)[1], 'and', names(bp$out)[2]),
col='gray')
In general, measurements and counts are prone to have outliers. It is not common to speak about outliers in categorical data since they have few levels; however, if they had many levels, we could find outliers if the variable is ordinal.
From what I said above, the subjective side of finding outliers lies in the decision of what is normal. In the case of the boxplot, the decision has been to accept as normal the values that have a prudent distance from the first or last quartile. This distance is 1.5 times the difference between the quartiles (a.k.a. Interquartle Range or IQR). Then, if a outlier is found, the whisker ends in a position different than the actual minimum or maximal value of the data.
Exercise:
Do some research and make a histogram and a density plot using ggplot for the variable we just used above.